44 research outputs found

    Time-domain Ad-hoc Array Speech Enhancement Using a Triple-path Network

    Full text link
    Deep neural networks (DNNs) are very effective for multichannel speech enhancement with fixed array geometries. However, it is not trivial to use DNNs for ad-hoc arrays with unknown order and placement of microphones. We propose a novel triple-path network for ad-hoc array processing in the time domain. The key idea in the network design is to divide the overall processing into spatial processing and temporal processing and use self-attention for spatial processing. Using self-attention for spatial processing makes the network invariant to the order and the number of microphones. The temporal processing is done independently for all channels using a recently proposed dual-path attentive recurrent network. The proposed network is a multiple-input multiple-output architecture that can simultaneously enhance signals at all microphones. Experimental results demonstrate the excellent performance of the proposed approach. Further, we present analysis to demonstrate the effectiveness of the proposed network in utilizing multichannel information even from microphones at far locations.Comment: Accepted for publication in INTERSPEECH 202

    Perceptual Evaluation of Spatial Room Impulse Response Extrapolation by Direct and Residual Subspace Decomposition

    Get PDF
    Six-degrees-of-freedom rendering of an acoustic environment can be achieved by interpolating a set of measured spatial room impulse responses (SRIRs). However, the involved measurement effort and computational expense are high. This work compares novel ways of extrapolating a single measured SRIR to a target position. The novel extrapolation techniques are based on a recently proposed subspace method that decomposes SRIRs into a direct part, comprising direct sound and salient reflections, and a residual. We evaluate extrapolations between different positions in a shoebox-shaped room in a multi-stimulus comparison test. Extrapolation using a residual SRIR and salient reflections that match the reflections at the target position is rated as perceptually most similar to the measured reference

    Direct and Residual Subspace Decomposition of Spatial Room Impulse Responses

    Get PDF
    Psychoacoustic experiments have shown that directional properties of the direct sound, salient reflections, and the late reverberation of an acoustic room response can have a distinct influence on the auditory perception of a given room. Spatial room impulse responses (SRIRs) capture those properties and thus are used for direction-dependent room acoustic analysis and virtual acoustic rendering. This work proposes a subspace method that decomposes SRIRs into a direct part, which comprises the direct sound and the salient reflections, and a residual, to facilitate enhanced analysis and rendering methods by providing individual access to these components. The proposed method is based on the generalized singular value decomposition and interprets the residual as noise that is to be separated from the other components of the reverberation. Large generalized singular values are attributed to the direct part, which is then obtained as a low-rank approximation of the SRIR. By advancing from the end of the SRIR toward the beginning while iteratively updating the residual estimate, the method adapts to spatio-temporal variations of the residual. The method is evaluated using a spatio-spectral error measure and simulated SRIRs of different rooms, microphone arrays, and ratios of direct sound to residual energy. The proposed method creates lower errors than existing approaches in all tested scenarios, including a scenario with two simultaneous reflections. A case study with measured SRIRs shows the applicability of the method under real-world acoustic conditions. A reference implementation is provided

    Spatial Subtraction of Reflections from Room Impulse Responses Measured With a Spherical Microphone Array

    Get PDF
    We propose\ua0a\ua0method for the decomposition of\ua0measured\ua0directional\ua0room\ua0impulse\ua0responses\ua0(DRIRs) into prominent\ua0reflections\ua0and\ua0a\ua0residual. The method comprises obtaining\ua0a\ua0fingerprint of the time-frequency signal that\ua0a\ua0given\ua0reflection\ua0carries, imposing this time-frequency fingerprint on\ua0a\ua0plane-wave prototype that exhibits the same propagation direction as the\ua0reflection, and finally subtracting this plane-wave prototype from the DRIR. Our main contributions are the formulation of the problem as\ua0a\ua0spatial\ua0subtraction\ua0as well as the incorporation of order truncation,\ua0spatial\ua0aliasing and regularization of the radial filters into the definition of the underlying beamforming problem. We demonstrate, based on simulated as well as\ua0measured\ua0array\ua0impulse\ua0responses, that our method increases the accuracy of the model of the\ua0reflection\ua0under test and consequently decreases the energy of the residual that remains in\ua0a\ua0measured\ua0DRIR after the\ua0spatial\ua0subtraction

    Topological Sound Propagation with Reverberation Graphs

    Get PDF
    International audienceReverberation graphs is a novel approach to estimate global soundpressure decay and auralize corresponding reverberation effects in interactive virtual environments. We use a 3D model to represent the geometry of the environment explicitly, and we subdivide it into a series of coupled spaces connected by portals. Off-line geometrical-acoustics techniques are used to precompute transport operators, which encode pressure decay characteristics within each space and between coupling interfaces. At run-time, during an interactive simulation, we traverse the adjacency graph corresponding to the spatial subdivision of the environment. We combine transport operators along different sound propagation routes to estimate the pressure decay envelopes from sources to the listener. Our approach compares well with off-line geometrical techniques, but computes reverberation decay envelopes at interactive rates, ranging from 12 to 100 Hz. We propose a scalable artificial reverberator that uses these decay envelopes to auralize reverberation effects, including room coupling. Our complete system can render as many as 30 simultaneous sources in large dynamic virtual environments

    Perceptual Evaluation of Spatial Room Impulse Response Extrapolation by Direct and Residual Subspace Decomposition

    Get PDF
    Six-degrees-of-freedom rendering of an acoustic environment can be achieved by interpolating a set of measured spatial room impulse responses (SRIRs). However, the involved measurement effort and computational expense are high. This work compares novel ways of extrapolating a single measured SRIR to a target position. The novel extrapolation techniques are based on a recently proposed subspace method that decomposes SRIRs into a direct part, comprising direct sound and salient reflections, and a residual. We evaluate extrapolations between different positions in a shoebox-shaped room in a multi-stimulus comparison test. Extrapolation using a residual SRIR and salient reflections that match the reflections at the target position is rated as perceptually most similar to the measured reference

    Towards Improved Room Impulse Response Estimation for Speech Recognition

    Full text link
    We propose to characterize and improve the performance of blind room impulse response (RIR) estimation systems in the context of a downstream application scenario, far-field automatic speech recognition (ASR). We first draw the connection between improved RIR estimation and improved ASR performance, as a means of evaluating neural RIR estimators. We then propose a GAN-based architecture that encodes RIR features from reverberant speech and constructs an RIR from the encoded features, and uses a novel energy decay relief loss to optimize for capturing energy-based properties of the input reverberant speech. We show that our model outperforms the state-of-the-art baselines on acoustic benchmarks (by 72% on the energy decay relief and 22% on an early-reflection energy metric), as well as in an ASR evaluation task (by 6.9% in word error rate)

    Chat2Map: Efficient Scene Mapping from Multi-Ego Conversations

    Full text link
    Can conversational videos captured from multiple egocentric viewpoints reveal the map of a scene in a cost-efficient way? We seek to answer this question by proposing a new problem: efficiently building the map of a previously unseen 3D environment by exploiting shared information in the egocentric audio-visual observations of participants in a natural conversation. Our hypothesis is that as multiple people ("egos") move in a scene and talk among themselves, they receive rich audio-visual cues that can help uncover the unseen areas of the scene. Given the high cost of continuously processing egocentric visual streams, we further explore how to actively coordinate the sampling of visual information, so as to minimize redundancy and reduce power use. To that end, we present an audio-visual deep reinforcement learning approach that works with our shared scene mapper to selectively turn on the camera to efficiently chart out the space. We evaluate the approach using a state-of-the-art audio-visual simulator for 3D scenes as well as real-world video. Our model outperforms previous state-of-the-art mapping methods, and achieves an excellent cost-accuracy tradeoff. Project: http://vision.cs.utexas.edu/projects/chat2map.Comment: Accepted to CVPR 202

    SoundSpaces 2.0: A Simulation Platform for Visual-Acoustic Learning

    Full text link
    We introduce SoundSpaces 2.0, a platform for on-the-fly geometry-based audio rendering for 3D environments. Given a 3D mesh of a real-world environment, SoundSpaces can generate highly realistic acoustics for arbitrary sounds captured from arbitrary microphone locations. Together with existing 3D visual assets, it supports an array of audio-visual research tasks, such as audio-visual navigation, mapping, source localization and separation, and acoustic matching. Compared to existing resources, SoundSpaces 2.0 has the advantages of allowing continuous spatial sampling, generalization to novel environments, and configurable microphone and material properties. To our knowledge, this is the first geometry-based acoustic simulation that offers high fidelity and realism while also being fast enough to use for embodied learning. We showcase the simulator's properties and benchmark its performance against real-world audio measurements. In addition, we demonstrate two downstream tasks -- embodied navigation and far-field automatic speech recognition -- and highlight sim2real performance for the latter. SoundSpaces 2.0 is publicly available to facilitate wider research for perceptual systems that can both see and hear.Comment: Camera-ready version. Website: https://soundspaces.org. Project page: https://vision.cs.utexas.edu/projects/soundspaces
    corecore